Empirical Q-Value Iteration

نویسندگان

چکیده

We propose a new simple and natural algorithm for learning the optimal [Formula: see text]-value function of discounted-cost Markov decision process (MDP) when transition kernels are unknown. Unlike classical algorithms MDPs, such as text]-learning actor-critic algorithms, this does not depend on stochastic approximation-based method. show that our algorithm, which we call empirical iteration converges to function. also give rate convergence or nonasymptotic sample complexity bound an asynchronous (or online) version will work. Preliminary experimental results suggest faster ballpark estimate compared with algorithms.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Empirical Q-Value Iteration

We propose a new simple and natural algorithm for learning the optimal Q-value function of a discounted-cost Markov Decision Process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as Q-learning and ‘actor-critic’ algorithms, this algorithm doesn’t depend on a stochastic approximation-based method. We show that our algorithm, which we call ...

متن کامل

Boosted Fitted Q-Iteration

This paper is about the study of B-FQI, an Approximated Value Iteration (AVI) algorithm that exploits a boosting procedure to estimate the action-value function in reinforcement learning problems. B-FQI is an iterative off-line algorithm that, given a dataset of transitions, builds an approximation of the optimal action-value function by summing the approximations of the Bellman residuals acros...

متن کامل

Factored Value Iteration Converges

In this paper we propose a novel algorithm, factored value iteration (FVI), for the approximate solution of factored Markov decision processes (fMDPs). The traditional approximate value iteration algorithm is modified in two ways. For one, the least-squares projection operator is modified so that it does not increase max-norm, and thus preserves convergence. The other modification is that we un...

متن کامل

Value Pursuit Iteration

Value Pursuit Iteration (VPI) is an approximate value iteration algorithm that finds a close to optimal policy for reinforcement learning problems with large state spaces. VPI has two main features: First, it is a nonparametric algorithm that finds a good sparse approximation of the optimal value function given a dictionary of features. The algorithm is almost insensitive to the number of irrel...

متن کامل

External Memory Value Iteration

We propose a unified approach to disk-based search for deterministic, non-deterministic, and probabilistic (MDP) settings. We provide the design of an external Value Iteration algorithm that performs at most O(lG · scan(|E|) + tmax · sort(|E|)) I/Os, where lG is the length of the largest back-edge in the breadth-first search graph G having |E| edges, tmax is the maximum number of iterations, an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Stochastic systems

سال: 2021

ISSN: ['1946-5238']

DOI: https://doi.org/10.1287/stsy.2019.0062